Method for partitioning a block of data into subblocks and for storing and communcating such subblocks

ABSTRACT

This invention provides a method and apparatus for detecting common spans within one or more data blocks by partitioning the blocks (FIG. 4) into subblocks and searching the group of subblocks (FIG. 12) (or their corresponding hashes (FIG. 13)) for duplicates. Blocks can be partitioned into subblocks using a variety of methods, including methods that place subblock boundaries at fixed positions (FIG. 3), methods that place subblock boundaries at data-dependent positions (FIG. 3), and methods that yield multiple overlapping subblocks (FIG. 6). By comparing the hashes of subblocks, common spans of one or more blocks can be identified without ever having to compare the blocks or subblocks themselves (FIG. 13). This leads to several applications including an incremental backup system that backs up changes rather than changed files (FIG. 25), a utility that determines the similarities and differences between two files (FIG. 13), a file system that stores each unique subblock at most once (FIG. 26), and a communications system that eliminates the need to transmit subblocks already possessed by the receiver (FIG. 19).

INTRODUCTION

The present invention provides a method and apparatus for partitioningone or more blocks of data into subblocks for the purpose ofcommunicating and storing such subblocks in an efficient manner.

BACKGROUND

Much of the voluminous amount of information stored, communicated, andmanipulated by modern computer systems is duplicated within the same ora related computer system. It is commonplace, for example, for computersto store many slightly differing versions of the same document. It isalso commonplace for data transmitted during a backup operation to bealmost identical to the data transmitted during the previous backupoperation. Computer networks also must repeatedly carry the same orsimilar data in accordance the requirements of their users.

Despite the obvious benefits that would flow from a reduction in theredundancy of communicated and stored data, few computer systems performany such optimization. Some instances can be found at the applicationlevel, one example being the class of incremental backup utilities thatsave only those files that have changed since the most recent backup.However, even these utilities do not attempt to exploit the significantsimilarities between old and new versions of files, and between filessharing other close semantic ties. This kind of redundancy can beapproached only by analysing the contents of the files.

The present invention addresses the potential for reducing redundancy byproviding an efficient method for identifying identical portions of datawithin a group of blocks of data, and for using this identification toincrease the efficiency of systems that store and communicate data.

SUMMARY OF THE INVENTION

To identify identical portions of data within a group of blocks of data,the blocks must be analysed. One simple approach is to divide the blocksinto fixed-length (e.g. 512-byte) subblocks and compare these with eachother so as to identify all identical subblocks. This knowledge can thebe used to manage the blocks in more efficient ways.

Unfortunately, the partitioning of blocks into fixed-length subblocksdoes not always provide a suitable framework for the recognition ofduplicated portions of data, as identical portions of data can occur indifferent sizes and places within a group of blocks of data. FIG. 1shows how division into fixed-size subblocks of two blocks (whose onlydifference is the insertion of a single byte (`X`)) fails to generateidentical subblocks. A comparison of the two groups of subblocks wouldreveal no identical pairs of subblocks even thought the two originalblocks differ by just one character.

A better approach is to partition each block using the data in the blockitself to determine the position of the partitions.

In an aspect of the invention, the blocks are partitioned at boundariesdetermined by the content of the data itself. For example, a block couldbe partitioned at each point at which the preceding three bytes has to aparticular constant value. FIG. 2 shows how such a data dependentpartitioning could turn out, and contrasts it with a fixed-lengthpartitioning. In FIG. 3 data independent partitioning generates sevendistinct subblocks whereas the data-dependent partitioning generatesjust four, allowing much of the similarity between the two blocks to bedetected.

The fact that a partitioning is data dependent does not imply that itmust incorporate any knowledge of the syntax or semantics of the data.So long as the boundaries are positioned in a manner dependent on thelocal data content, identical subblocks are likely to be formed fromidentical portions of data, even if the two portions are not identicallyaligned relative to the start of their enclosing blocks (FIG. 3).

Once the group of blocks has blocks has been partitioned into subblocks,the resulting group of subblocks can be manipulated in a manner thatexploits the occurrence of duplicate subblocks. This leads to a varietyof applications, some of which are described below. However, theapplication of a further aspect of the invention leads to even greaterbenefits.

In a further aspect of the invention, the hash of one or more subblocksis calculated. The hash function can be an ordinary hash function or oneproviding cryptographic strength. The hash function maps each subblockinto a small tractable value (e.g. 128 bits) that provides an identityof the subblock. These hashes can usually be manipulated moreefficiently than their corresponding subblocks.

Some applications of aspects of this invention are:

Fine-grained incremental backups: Conventional incremental backuptechnology uses the file as the unit of backup. However, in practicemany large files change only slightly, resulting in a wastefulre-transmission of changed files. By storing the hashes of subblocks ofthe previous versions of files, the transmission of unchanged subblockscan be eliminated.

Communications: By providing a framework for communicating the hashes ofsubblocks, the invention can eliminate the transmission of subblocksalready possessed by the receiver.

Differences: The invention could be used as the basis of a program thatdetermines the areas of similarity and difference between two blocks.

Low-redundancy file system: Data stored in a file system can bepartitioned into subblocks whose hashes can be compared so as toeliminate the redundant storage of identical subblocks.

Virtual memory: Virtual memory could be organized by subblock using atable of hashes to determine if a subblock is somewhere in memory.

Clarification of Terms

The term block and subblock both refer, without limitation, to finiteblocks or infinite blocks (sometimes called streams) of zero or morebits or bytes of digital data. Although the two different terms("blocks" and "subblock") essentially describe the same substance(digital data), the two different terms have been employed in thisspecification to indicate the role that a particular piece of data isplaying. The term "block" is usually used to refer to raw data to bemanipulated by aspects of the invention. The term "subblock" is usuallyused to refer to a part of a block. "Blocks" are "partitioned" into"subblocks".

The term partition has its usual meaning of exhaustively dividing anentity into mutually exclusive parts. However, within thisspecification, the term also includes cases where:

Not all of the block is subdivided.

Multiple overlapping subblocks are formed.

A natural number is a non-negative integer (0, 1, 2, 3, 4, 5, . . . ).

Where the phrase zero or more is used, this phrase is intended toencompass the degenerate case where the objects being enumerated are notconsidered at all, as well as the case where zero or more objects areused.

BRIEF DESCRIPTION

The following aspects of this invention are numbered for referencepurposes. The terms "block" and "subblock" refer to blocks and subblocksof digital data.

1. In an aspect of the invention, the invention provides a method fororganizing a block b of digital data for the purpose of storage,communication, or comparison, by partitioning said block into subblocksat one or more positions k|k+1 within said block for which b[k-A+1 . . .k+B] satisfies a predetermine constraint, where A and B are naturalnumbers.

Note: The specification of this aspect encompasses the degenerate casein which either A or B is zero. The specification also includes the casewhere the constraint does not pay attention to some parts of b[k-A+1 . .. k+B]. For example, a constraint that pays attention only to (say)b[k-3] and b[k+2] would fall under the classes of constraintcorresponding to A≧4 and B≧2.

2. In a further aspect of the invention, the invention provides a methodaccording to aspect 1 in which the constraint comprises the hash of someor all of b[k-A+1 . . . k+B].

3. In a further aspect of the invention, the invention provides a methodaccording to aspect 1, for locating the nearest subblock boundary on aside of a position p|p+1 within a said block, comprising the step of:

a. Evaluating whether said predetermined constraint is satisfied at eachposition k|k+1, for increasing (or decreasing) k, where k starts withthe value p.

4. In a further aspect of the invention, the invention provides a methodaccording to aspect 1, wherein one or more bounds are imposed on thesize of one or more subblocks.

5. In a further aspect of the invention, the invention provides a methodaccording to aspect 1, wherein additional subblocks are formed from oneor more groups of subblocks.

6. In a further aspect of the invention, the invention provides a methodaccording to aspect 1, wherein an additional hierarchy of subblocks isformed from one or more groups of contiguous subblocks.

7. In a further aspect of the invention, the invention provides a methodaccording to one of aspects 1 to 6, comprising the further step of:

b. Calculating the hash of each of one or more of said subblocks.

Note: The resulting collection of hashes is particularly useful if H isa strong one-way hash function.

8. In a further aspect of the invention, the invention provides a methodaccording to one of aspects 1 to 6, comprising the further step of:

b. Forming a projection of said block, being an ordered or unorderedcollection of elements, wherein each element consists of a subblock, anidentity of a subblock, or a reference of a subblock.

Note: The specification of this aspect is intended to admit collectionsthat contain a mixture of various kinds of identities and references.

Note: In most applications, the output of this aspect will be an orderedlist of hashes of the subblocks of the block.

9. In a further aspect of the invention, the invention provides a methodfor comparing one or more blocks, comprising the steps of:

a. Partitioning one or more of said blocks into one or more subblocks inaccordance with one of aspects 1 to 6.

b. Forming a projection of each said block, being an ordered orunordered collection of elements, wherein each element consists of asubblock, an identity of a subblock, or a reference of a subblock.

c. Comparing the elements of said projections of said blocks.

10. In a further aspect of the invention, the invention provides amethod for representing one or more blocks, comprising:

(i) A collection of subblocks;

(ii) Block representatives (e.g. filenames) which are mapped to lists ofentries that identify subblocks;

whereby the modification of one of said blocks involves the followingsteps:

a. Partitioning some or all of said modified block into subblocks inaccordance with one of aspects 1 to 6;

b. Adding to said collection of subblocks zero or more subblocks thatare not already in said collection, and updating said subblock listassociated with said modified block.

11. In a further aspect of the invention, the invention provides amethod according to aspect 10, in which step b is replaced by:

b. Removing from said collection of subblocks zero or more subblocks,and updating said subblock list associated with said modified block.

12. In a further aspect of the invention, the invention provides amethod according to aspect 10, in which step b is replaced by:

b. Adding to said collection of subblocks zero or more subblocks thatare not already in the collection, removing from said collection ofsubblocks zero or more subblocks, and updating said subblock listassociated with said modified block.

13. In a further aspect of the invention, the invention provides amethod for an entity E1 to communicate a block X to E1 where E1possesses the knowledge that E2 possesses a group Y of subblocks Y₁ . .. Y_(m), comprising the following steps:

a. Partitioning X into subblocks X₁ . . . X_(n) in accordance with oneof aspects 1 to 6;

b. Transmitting from E1 to E2 the contents of zero or more subblocks inX, and the remaining subblocks as references to subblocks in Y₁ . . .Y_(m) and to subblocks already transmitted.

Note: In most implementations of this aspect, the subblocks whosecontents are transmitted will be those in X that are not in Y, and forwhich no identical subblock has previously been transmitted.

Note: To posses knowledge that E2 possesses Y₁ . . . Y_(m), E1 need notactually posses Y₁ . . . Y_(m) itself. E1 need only posses theidentities of Y₁ . . . y_(m) (e.g. the hashes of each subblock Y₁ . . .Y_(m)). This specification is intended to admit any other representationin which E1 may have the knowledge that E2 possesses (or has access to)Y₁ . . . Y_(m). In particular, the knowledge may take the form of aprojection of Y.

Note: It is implicit in this aspect the E1 will be able to usecomparison (or other methods) to use its knowledge of E2's possession ofY to determine the set of subblocks that are common to both X and Y. Forexample, if E1 possessed the hashes of the subblocks of Y, it couldcompare them to the hashes of the subblocks of X to determine thesubblocks common to both X and Y. Subblocks that are not common can betransmitted explicitly. Subblocks that are common to both X and Y can betransmitted by transmitting a reference to the subblock.

14. In a further aspect of the invention, the invention provides amethod for an entity E1 to communicate one or more subblocks of a groupX of subblocks X₁ . . . X_(n) to E2 where E1 possesses the knowledgethat E2 possesses the blocks Y, comprising the following steps:

a. Partitioning Y into subblocks Y₁ . . . Y_(m) in accordance with oneor aspects 1 to 6;

b. Transmitting from E1 to E2 the contents of zero or more subblocks inX, and the remaining subblocks as references to subblocks in Y and tosubblocks already transmitted.

15. In a further aspect of the invention, the invention provides amethod for an entity E1 to communicate a block X to E2 where E1possesses the knowledge that E2 possesses block Y, comprising thefollowing steps:

a. Partitioning in accordance with one of aspects 1 to 6, X intosubblocks X₁ . . . X_(n) and Y into subblocks Y₁ . . . Y_(m) ;

b. Transmitting from E1 to E2 the contents of subblocks in X, and theremaining subblocks as references to subblocks in Y and to subblocksalready transmitted.

16. In a further aspect of the invention, the invention provides amethod for constructing a block D from a block X and a group Y ofsubblocks Y₁ . . . Y_(m) such that X can be constructed from Y and D,comprising the following steps:

a. Partitioning X into subblocks X₁ . . . X_(n) in accordance with oneof aspects 1 to 6;

b. Constructing D from one or more of the following: the contents ofzero or more subblocks in X, references to zero or more subblocks in Y,and references to zero or more subblocks in D.

Note: Step b above is intended to encompass the case where a mixture ofthe elements it describes is used.

17. In a further aspect of the invention, the invention provides amethod for constructing a block D from a group X of subblocks X₁ . . .X_(n) and a block Y such that X can be constructed from Y and D,comprising the following steps:

a. Partitioning Y into subblocks Y₁ . . . Y_(m) in accordance with oneof aspects 1 to 6;

b. Constructing D from one or more of the following: the contents ofzero or more subblocks in X, references to zero or more subblocks in Y,and references to zero or more subblocks in D.

18. In a further aspect of the invention, the invention provides amethod for constructing a block D from a block X and a block Y such thatX can be constructed from Y and D, comprising the following steps:

a. Partitioning in accordance with one of aspects 1 to 6, X intosubblocks X₁ . . . X_(n) and Y into subblocks Y₁ . . . Y_(m) ;

b. Constructing D from one or more of the following: the contents ofzero or more subblocks in X, references to zero or more subblocks in Y,and references to zero or more subblocks in D.

19. In a further aspect of the invention, the invention provides amethod for constructing a block D from a block X and a projection of Y,said projection comprising an ordered or unordered collection ofelements wherein each element consists of a subblock in Y, an identityof a subblock in Y, or a reference of a subblock in Y, such that X canbe constructed from Y and D, comprising the following steps:

a. Partitioning X into subblocks X₁ . . . X_(n) in accordance with oneof aspects 1 to 6;

b. Constructing D from one or more of the following: the contents ofzero or more subblocks in X, references to zero or more subblocks in Y,and references to zero or more subblocks in D.

20. In a further aspect of the invention, the invention provides amethod for constructing a block X from a block Y and a block D,comprising the following steps:

a. Partitioning Y into subblocks Y₁ . . . Y_(m) in accordance with oneor aspects 1 to 6;

b. Constructing X from D and Y by constructing the subblocks of X basedon one or more of:

(i) subblocks contained within D;

(iii) references in D to subblocks in Y;

(iii) references in D to subblocks in D;

21. In a further aspect of the invention, the invention provides amethod for constructing a group X of subblocks X₁ . . . X_(n) from ablock Y and a block D, comprising the following steps:

a. Partitioning Y into subblocks Y₁ . . . Y_(m) in accordance with oneof aspects 1 to 6;

b. Constructing X₁ . . . X_(n) from D and Y based on one or more of:

(i) subblocks contained within D;

(iii) references in D to subblocks in Y;

(iii) references in D to subblocks in D;

22. In a further aspect of the invention, the invention provides amethod for communicating a data block X from one entity E1 to anotherentity E2 comprising the following steps:

a. Partitioning X into subblocks X₁ . . . X_(n) in accordance with oneof aspects 1 to 6;

b. Transmitting from E1 to E2 an identity of one or more subblocks;

c. Transmitting from E2 to E1 information communicating the presence orabsence of subblocks at E2;

d. Transmitting from E1 to E2 at least the subblocks identified in step(c) as not being present at E2.

Note: The information communicated in step (c) could take the form of abitmap (or a compressed bitmap) corresponding to the subblocks referredto in step (a). It could also take many other forms.

Note: If a group of subblocks are to be transmitted, the above stepscould be performed completely for each subblock before moving onto thenext subblock. The steps could be applied to any subgroup of subblocks.

23. In a further aspect of the invention, the invention provides amethod for communicating a block X from one entity E1 to another entityE2, comprising the following steps:

a. Partitioning X into subblocks X₁ . . . X_(n) in accordance with oneof aspects 1 to 6;

b. Transmitting from E2 to E1 information communicating the presence orabsence at E2 of members of a group Y or subblocks Y₁ . . . Y_(m) ;

c. Transmitting from E1 to E2 the contents of zero or more subblocks inX, and the remaining subblocks as references to subblocks in Y₁ . . .Y_(m) and to subblocks transmitted.

24. In a further aspect of the invention, the invention provides amethod for an entity E2 to communicate to an entity E1 the fact that E2possesses a block Y, comprising the following steps:

a. Partitioning Y into subblocks Y₁ . . . Y_(m) in accordance with oneor aspects 1 to 6;

b. Transmitting from E2 to E1 references of the subblocks Y₁ . . .Y_(m).

25. In a further aspect of the invention, the invention provides amethod for an entity E1 to communicate a subblock X_(i) to an entity E2,comprising the following steps:

a. Partitioning X into subblocks X₁ . . . X_(n) in accordance with oneof aspects 1 to 6;

b. Transmitting from E2 to E1 an identity of X_(i) ;

c. Transmitting X_(i) from E1 to E2.

Note: This aspect applies (among other applications) to the case of anetwork server E1 that serves subblocks to clients such as E2, given theidentities (e.g. hashes) of the requested subblocks.

26. In a further aspect of the invention, the invention provides amethod according to one of aspects 1 to 6, wherein said subblocks arecompared by comparing the hashes of said subblocks.

27. In a further aspect of the invention, the invention provides amethod according to one of aspects 1 to 6, wherein subsets of identicalsubblocks within a group of one or more subblocks are found, byinserting each subblock, an identity of each subblock, a reference ofeach subblock, or a hash of each subblock, into a data structure.

28. In further aspect of the invention, the invention provides anapparatus for organizing a block b of digital data for the purpose ofstorage, communication, or comparison, by partitioning said block intosubblocks at one or more positions k|k+1 within said block for whichb[k-A+1 . . . k+B] satisfies a predetermined constraint, where A and Bare natural numbers.

Note: The specification of this aspect encompasses the degenerate casein which either A or B is zero. The specification also includes the casewhere the constraint does not pay attention to some parts of b[k-A+1 . .. k+B]. For example, a constraint that pays attention only to (say)b[k-3] and b[k+2] would fall under the classes of constraintcorresponding to A≧4 and B≧2.

29. In a further aspect of the invention, the invention provides anapparatus according to aspect 28 in which the constraint comprises thehash of some or all of b[k-A+1 . . . k+B].

30. In a further aspect of the invention, the invention provides anapparatus according to aspect 28, for locating the nearest subblockboundary on a side of a position p|p+1 within a said block, comprisingthe step of:

a. Evaluating whether said predetermined constraint is satisfied at eachposition k|k+1, for increasing (or decreasing) k, where k starts withthe value p.

BRIEF DESCRIPTION OF FIGURES

FIG. 1 shows how data can become "misaligned" relative to its containingblocks when data is inserted.

FIG. 2 shows how data can be divided into fixed-width subblocks orvariable-width subblocks.

FIG. 3 shows how data-dependent partition move with the data when thedata is shifted (e.g. by an insertion) H.

FIG. 4 depicts the data-dependent partitioning of a block b of data intosubblocks using a constraint F.

FIG. 5 depicts the search within a block b for a subblock boundary usinga constraint F.

FIG. 6 shows how a block may be subdivided in different ways usingdifferent partitioning constraints.

FIG. 7 shows how "higher order" subblocks can be constructed from one ormore initial subblocks.

FIG. 8 shows how different partitioning functions can produce subblocksof differing average sizes.

FIG. 9 shows how subblocks can be organized into a hierarchy. Such ahierarchy can be constructed by progressively restricting a constraintF.

FIG. 10 depicts a method (and apparatus) for the partitioning of a blockb into subblocks using a constraint F, and the calculation of the hashesof the subblocks using hash function H.

FIG. 11 depicts the partitioning of a block b into subblocks using aconstraint F, and the projection of those subblocks into a structureconsisting of subblock hashes, subblock data, and subblock references.

FIG. 12 depicts a method (and apparatus) for partitioning two blocks b1and b2 into subblocks, using a constraint F, and then comparing thesubblocks.

FIG. 13 depicts a method (and apparatus) for the partitioning using aconstraint F, of two blocks b1 and b2 into subblocks, the calculation ofthe hashes of the subblocks using H, and the comparison of those hasheswith each other to determine (among other things) subblocks common toboth b1 and b2.

FIG. 14 depicts a method (and apparatus) for a file system that employsan aspect of the invention to eliminate the multiple storage ofsubblocks common to more than one file (or to different parts of thesame file).

FIG. 15 depicts a method (and apparatus) for the communication of ablock X from E1 to E2 where both E1 and E2 possess Y.

FIG. 16 depicts a method (and apparatus) for the construction of a blockD from which X may be later reconstructed, given Y.

FIG. 17 depicts a method (and apparatus) for the construction of a blockD from which X may be later reconstructed, given Y. In this case, theentity constructing D does not have access to Y, only to a projection ofY (being perhaps the hashes of the subblocks of Y).

FIG. 18 depicts a method (and apparatus) for the reconstruction of Xfrom the blocks Y and D.

FIG. 19 depicts a method (and apparatus (E1 and E2 at each time)) forthe communication of a block X from entity E2 where E2 already possessesY.

FIG. 20 depicts a method (and apparatus (E1 and E2 at each time)) forthe communication of a block X from entity E1 to entity E2 where E2already possesses Y and where E2 first discloses to E1 information aboutY.

FIG. 21 depicts a method (and apparatus) for the communication, fromentity E2 to entity E1, information about a block (or group ofsubblocks) Y at E2.

FIG. 22 depicts a method (and apparatus (E1 and E2 at each time)) forthe communication from entity E1 to entity E2 of subblock X_(i)following a request by entity E2 for the subblock X_(i).

FIG. 23 depicts an apparatus for partitioning a block b (the input)using a constraint F. The output is a set of subblock boundarypositions.

FIG. 24 depicts a method (and apparatus) for the partitioning of a blockb into subblocks using constraint F, and the projection of thosesubblocks into a table of subblock hashes.

FIG. 25 depicts a method (and apparatus) for the transmission fromentity E1 to E2 of a block X where E2 possesses Y and E1 possesses atable of the hashes of the subblocks of Y (a projection of Y).

FIG. 26 depicts a method (and apparatus) for a file system that employsan aspect of the invention to eliminate the multiple storage of datacommon to more than file (or to different parts of the same file).

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

This section contains a detailed discussion of mechanisms that could beused to implement aspects of the invention. It also contains examples ofimplementations of selected aspects of the invention. However, nothingin this section should be interpreted as a limitation on the scope ofthis patent.

Units of Information

Aspects of this invention can be applied at various levels ofgranularity of data. For example, if the data was treated as a stream ofbits, boundaries could be placed between any two bits. However, if thedata was treated as a stream of bytes, boundaries would usually bepositioned only between bytes. The invention could be applied with anyunit of data, and in this document references to bits and bytes shouldusually be interpreted as admitting any granularity.

The Concept of Entity

At various places, this patent specification uses the term "entity" todescribe an agent. This term is purposefully vague and is intended tocover all forms of agent including, but not limited to:

Computer systems.

Networks of computer systems.

Processes in computer systems.

File systems.

Components of software.

Dedicated computer systems.

Communications systems.

The Concepts of Identity and Reference

This patent specification frequently refers to "identities" of subblocksand "references" to subblocks. These terms are not intended to bedefined precisely.

The identity of a subblock means any piece of information that could beused in place of the subblock for the purpose of comparison foridenticality. Identities include, but are not limited to:

The subblock itself.

A hash of the subblock.

The subblock acts as its own identity because subblocks themselves canbe compared with each other. Hashes of subblocks also act as identitiesof subblocks because hashes of subblocks can be compared with each otherto determine if their corresponding subblocks are identical.

A reference to a subblock means any piece of information that could beused in practice by one entity to identify to another entity (or itself)a particularly valued subblock, where the two entities may already sharesome knowledge. For example, the two entities might each possess theknowledge that the other entity already possesses ten subblocks of knownvalues having particular index values numbered one to ten.

Once two entities have a basis of shared knowledge, it is possible forthem to identify a subblock in ways more concise than the transmissionof an identity. A reference to a particularly valued subblock can take(without limitation) the following forms:

An identity.

An identifying number of a subblocks possessed by the receiver.

An identifying number of a subblock previously transmitted between thetwo communicants.

The location of the subblock in some shared data space.

As relative subblocks number.

Ranges of the above.

The concept of knowledge of a subblock is related to the concepts ofidentity and reference. An entity may have knowledge of a subblock (orknowledge that another entity possesses a subblock) without actuallypossessing the subblock itself. For example, it might possess anidentity of the subblock or a reference to the subblock.

The Use of Ranges

In any situation where a group of values that have contiguous values(e.g. 6, 7, 8, 9) is to be communicated or stored, such a group can berepresented using a range (e.g. 6-9) which may take up lesscommunication time or storage space. Ranges can be applied to all kindsof things, such as index values and subblock numbers. In particular, ifan entity notices that the references (to subblocks) that it is about totransmit are contiguous, it can replace the references with a range.

Ranges can be represented in any way that identifies the first and lastelement of the range. Three common representations are:

The first and last element of the range.

The first element and the length of the range.

The last element and the length of the range.

The concept of range can be generalized to include the compression ofany group of values that exhibit compressible structure.

The Use of Backward References

References can be used not only to refer to data shared by twocommunicants at the start of a transmission, but can also be used torefer to data communicated at some previous time during thetransmission.

For example, if an entity A notices that the subblock it is about totransmit to another entity B was not possessed by B at the start of thetransmission, but has since been transmitted from A to B, then A couldcode the second instance of the subblock as a reference to the previousinstance of the subblock. The range mechanism can be used here too.

No Requirement for Subblock Framing Information

It is possible that an entity E1 could transmit a group X of subblocksX₁ . . . X_(n) as a group to an entity E2 simply by sending theconcatenation of the subblocks. There may be no need for any framinginformation (e.g. information at the start of each subblock giving thelength of the subblock or "escape" codes to indicate subblockboundaries), as E2 is capable of partitioning X into X₁ . . . X_(n)itself.

No Requirement for Ordering Subblocks

If two entities E1 and E2 both possess the same unordered group Y ofsubblocks (or knowledge of such a group of subblocks) then even thoughE1 and E2 may not possess the subblocks in the same order, the subblockscan still be referred to using a subblock index or serial number. Thisis achieved by having E1 and E2 each sort their subblocks in accordancewith some mutually agreed (or universally defined) ordering method andthen number the subblocks in the resultant ordered group of subblocks.These number (or ranges of such numbers) can then be used to refer tothe subblocks.

An Overview of Hash Functions

Although the use of a hash function is not essential in all aspects ofthis invention, hash functions provide such advantages in theimplementation of this invention that an overview of them is warranted.

A hash function accepts a variable-length input block of bits andgenerates an output block of bits that is based on the input block. Mosthash functions guarantee that the output block will be of a particularlength (e.g. 16 bits) and aspire to provide a random, but deterministic,mapping between the infinite set of input blocks and the finite set ofoutput blocks. The property of randomness enables these outputs, called"hashes", to act as easily manipulated representatives of the originalblock.

Hash functions come in at least four classes of strength.

Narrow hash functions: Narrow hash functions are the weakest class ofhash functions and generate output values that are so narrow (e.g. 16bits) that the entire space of output values can be searched in areasonable amount of time. For example, an 8-bit hash function would mapany data block to a hash in the range 0 to 155. A 16-bit has functionwould map to a hash in the range 0 to 65535. Given a particular hashvalue, it would be possible to find a corresponding block simply bygenerating random blocks and feeding them into the narrow hash functionuntil the searched-for value appeared. Narrow hash functions are usuallyused to arbitrarily (but deterministically) classify a set of datavalues into a small number of groups. As such, they are useful forconstructing hash table data structures, and for detecting errors indata transmitted over noisy communication channels. Examples of thisclass: CRC-16, CRC-32. Fletcher checksum, the IP checksum.

Wide hash functions: Wide hash functions are similar to narrow hashfunctions except that their output values are significantly wider. At acertain point this quantitative difference implies a qualitativedifference. In a wide hash function, the output value is so wide (e.g.128 bits) that the probability of any two randomly chosen blocks havingthe same hashed value is negligible (e.g. about one in 10³⁸). Thisproperty enables these wide hashes to be used as "identities" of theblocks of data from which they are calculated. For example, if entity E1has a block of data and sends the wide hash of the block to an entityE2, then if entity E2 has a block that has the same hash, then thea-priori probability of the blocks actually being different isnegligible. The only catch is that wide hash functions are not designedto be non-invertible. Thus, while the space of (say)2¹²⁸ values is toolarge to search in the manner described for narrow hash functions, itmay be easy to analyse the hash function and calculate a blockcorresponding to a particular hash. Accordingly, E1 could fool E2 intothinking E1 had one block when it really had a different block. Examplesof this class: any 128-bit CRC algorithm.

Weak one-way hash functions: Weak one-way hash functions are not onlywide enough to provide "identity", but they also provide cryptographicassurance that it will be extremely difficult, given a particular hashvalue, to find a block corresponding to that hash value. Examples ofthis class: a 64-bit DES hash.

Strong one-way has functions: Strong one-way hash functions are the sameas weak one-way hash functions except that they have the additionalproperty of providing cryptographic assurance that it is difficult tofind any two different blocks that have the same hash value, where thehash value is unspecified. Examples of this class: MD4, MD5, and SHA-1.

These four classes of hash provide a range of hashing strengths fromwhich to choose. As might be expected, the speed of a hash functiondecreases with strength, providing a tradeoff, and different strengthsare appropriate in different applications. However, the difference issmall enough to admit the use of strong one-way hash functions in allbut the most time-critical applications.

The term cryptograpic hash is often used to refer to hashes that providecryptographic strength, encompassing both the class of weak one-way hashfunctions and the class of strong one-way hash functions. However, asstrong one-way hash functions are almost preferable to weak one-way hashfunctions, the term "cryptographic hash" is used mainly to refer to theclass of strong one-way hash functions.

The present invention can employ hash functions in at least two roles:

1. To determine subblock boundaries.

2. To generate subblock identities.

Depending on the application, hash functions from any of the fourclasses above could be employed in either role. However, as thedetermination of subblock boundaries does not require identity orcryptographic strength, it would be inefficient to use hash functionsfrom any but the weakest class. Similarly, the need for identity, theever-present threat of subversion, and the minor performance penalty forstrong one-way hash functions (compared to weak ones) suggests thatnothing less than strong one-way hash functions should be used tocalculate subblock identities.

The security dangers inherent in employing anything less than a strongone-way hash function to generate identities can be illustrated byconsidering a communications system or file system that incorporates theinvention using any such weaker hash function. In such a system, anintruder could modify a subblock (to be manipulated by a target system)in such a way that the modified subblock has the same a hash as anothersubblock known by the intruder to be already present in the targetsystem. This could result in the target system retaining its existingsubblock rather than replacing it by a new one. Such a weakness could beused (for example) to prevent a target system from properly applying asecurity patch retrieved over a network.

Thus, while wide has functions could be safely used to calculatesubblocks in systems not exposed to hostile humans, even weak one-wayhash functions are likely to be insecure in those systems that are.

We now turn to the ways in which hashes of blocks or subblocks canactually be used.

The Use of Cryptographic Hashes

The theoretical properties of cryptographic hashes (and here is meantstrong one-way hash functions) yield particularly interesting practicalproperties. Because such hashes are significantly wide, the probabilityof two randomly-chosen subblocks having the same hash is practicallyzero (for a 128-bit hash, it is about one in 10³⁸), and because it iscomputationally infeasible to find two subblocks having the same hash,it is practically guaranteed that no intelligent agent will be able todo so. The implication of these properties is that from a practicalperspective, the finite set of hash values for a particularcryptographic hash algorithm is one-to-one with the infinite set offinite variable length subblocks. This theoretically impossible propertymanifests itself in practice because of the practical infeasibility offinding two subblocks that hash to the same value.

This property means that, for the purposes of comparison (foridentically), cryptographic hashes may safely be used in place of thesubblocks from which they were calculated. As most cryptographic hashesare only about 128 bits long, hashes provide an extremely efficient wayto compare subblocks without requiring the direct comparison of thecontent of the subblocks themselves. Such comparisons can be used toeliminate many transmissions of information. For example, a subblock X₁on a computer C1 in Sydney could be compared with a subblock Y₁ on acomputer C2 in Boston by a computer C3 in Paris, with the totaltheoretical network traffic being just 256 bits (C1 and C2 each send the128-bit hash of their respective subblocks to C3 for comparison, and C3compares the two hashes).

Some of the ways in which cryptographic hashes could be used in aspectsof this invention are:

Cryptographic hashes can be used to compare two subblocks without havingto compare, or requiring access to, the content of the subblocks.

If it is necessary to be able to determine whether a subblock T isidentical to one of a group of subblocks, the subblocks themselves neednot be stored, just a collection of their hashes. The hash of anycandidate subblock can then be compared with the hashes in thecollection to establish whether the subblock is in the group ofsubblocks from which the collection of hashes was generated.

Cryptographic hashes can be used to ensure that the partitioning of ablock into subblocks and the subsequent reassembly of the subblocks intoa reconstructed block is error-free. This can be done by comparing thehash of the original block with the hash of the reconstructed block.

If an entity E1 calculates the hash of a subblock X₁ and transmits it toE2, then if E2 possesses X₁, or even just the hash of X₁, then E2 candetermine without any practical doubt that E1 possesses X1.

If an entity E1 passes a key (consisting of a block of bits) chosen atrandom to an entity E2, E2 may then prove to E1 that it possesses asubblock by sending E1 the hash of the concatenation of the key and thesubblock. This mechanism could be used as an additional check insecurity applications.

If a group of subblocks must be compared so as to find all subsets ofidentical subblocks, the corresponding set of hashes of the subblocksmay be calculated and compared instead.

Many of the uses of cryptographics hashes for subblocks can also beapplied to blocks. For example, cryptographic hashes can be used todetermine whether a block has changed at all since it was last backedup. Such a check could eliminate the need for further analysis.

Use of Hashes as a Safety Net

A potential disadvantage of deploying aspects of this invention is thatit will add extra complexity to the systems into which it isincorporated. This increased complexity carries the potential toincrease the chance of undetected failures.

The main mechanism of complexity introduced by many aspects of theinvention is the partitioning of blocks (e.g. files) into subblocks, andthe subsequent re-assembly of such subblocks. By partitioning a blockinto subblocks, a system creates the potential for subblocks to beerroneously added, deleted, rearranged, substituted, duplicated, or insome other way exposed to a greater risk of accidential error.

This risk can be reduced or eliminated by calculating the hash(preferably a cryptographic hash of the block before it is partitionedinto subblocks, storing the hash with an entity associated with theblock as a whole, and then later comparing the stored hash with acompound hash of the reconstructed block. Such a check would provide avery strong safety net that would virtually eliminate the risk ofundetected errors arising from the use of this invention.

Choosing a Partitioning Constraint Function

Although the requirements for the block partitioning constraint (e.g. inthe form of a constraint function F) are not stringent, care should betaken to select a function that suits the application to which it is tobe applied.

In situation where the data is highly structured and knowledge of thedata is available, a choice of an F that tends to place subblockboundaries at positions in the data that correspond to obviousboundaries in the data could be advantageous. However, in general, Fshould be chosen from the class of narrow hash functions. Use of anarrow hash function for F provides both efficiency and a(deterministic) randomness that will enable the implementation tooperate effectively over a wide-range of data.

One of the most important properties of F is the probability that F willplace a boundary at any particular point when applied to completelyrandom data. For example, a function with a probability of one wouldproduce a boundary between each bit (or byte), whereas a function with aprobability of zero would never produce any boundaries at all. In a realapplication, a more moderate probability would be chosen (e.g. 1/1024)so as to yield useful subblock sizes. The probability can be tuned tosuit the application.

We end this section with an example of a narrow hash function that hasbeen implemented and tested and seems to perform well on a variety ofdata types. The hash function calculates a hash value from three bytes.

    H(b.sub.1, b.sub.2, b.sub.3)=((40543×((b.sub.1 <<8)  (b.sub.2 <<4) b.sub.3))>>4) | p

The following notation has been used. "×" is multiplication. "<<" isleft bit shift. ">>" is right bit shift. "" is exclusive or. "|" ismodulo. The constant ρ is the inverse of the probability of placing aboundary at an arbitrary position in a randomly generated block of data,and can be set to any integer value in [0,65535]. However, in practiceit seems to be advantageous to choose values that are prime (Mersenneprimes seem to work well). The value 40543 was chosen carefully inaccordance with the hash function design guidelines provided in pages508-613 of the book:

Knuth D. E., "The Art of Computer Programming: Volume 3: Sorting andSearching", Addision Wesley, 1973.

The function generates a value in the range [0, ρ-1] and can be used inpractice by placing a boundary at each point where the preceding threebytes hash to a predetermined constant value V. This would imply thatits arguments b₁ . . . b₃ correspond to the argument A in aspect oneabove. To avoid pathological behaviour in the commonly occurring case ofruns of zeros, it is wise to choose a non-zero value for V.

In a real implementation, ρ was set to 511 and V was set to one.

Placing an Upper and Lower Bound on the Subblock Size

The use of data-dependent subblock boundaries provides a way todeterministically partition similar portions of data in acontext-independent way. However, if artifical bounds are not placed onthe subblock size, particular kinds of data will yield subblocks thatare too large or too small to be effective. For example, if a filecontains a block of a million identical bytes, any deterministicconstraint (that operates at the byte level) must either partition theblock into one subblock or a million subblocks. Both alternatives areundesirable.

A solution to this problem is artificially to impose an upper bound Uand a lower bound L on the subblock size. There seem to be a limitlessnumber of ways of doing this. Here are some examples:

Upper bound: Subdivide subblocks that are longer than U bytes at thepoints, U, 2U, 3U, and so on, where U is the chosen upperbound onsubblock size.

Upper bound: Subdivide subblocks that are longer than U bytes at pointsdetermined by a secondary hash function.

Lower bound: Of the set of boundaries that bound subblocks less than Lbytes long, remove those boundaries that are closer to theirneighbouring boundaries than their neighbouring boundaries are to theirneighbouring boundaries.

Lower bound: If the block is being scanned sequentially, do not place aboundary unless at least L bytes have been scanned since the previousboundary.

Lower bound: Of the set of boundaries that bound subblocks less than Lbytes long, remove those boundaries that satisfy some secondary hashfunction.

Lower bound: Of the set of boundaries that bound subblocks less than Lbytes long, remove randomly chosen boundaries until all the resultingsubblocks are at least L bytes long.

Many other such schemes could be devised.

The Use of Multiple Partitionings

In most applications the use of just one partitioning into subblockswill be sufficient. However, in some applications there may be a needfor more than one subblock partitioning. For example, in applicationswhere channel space is expensive, it may be appropriate to partitioneach block of data in W different ways, using W different constraintfunctions F₁ . . . F_(W) where each function provides a differentaverage subblock size. For example, four different partitions could beperformed using functions that provide subblocks of average length 256bytes, 1K, 10K, and 100K. By providing a range of different sizes ofsubblocks to choose from, such as organization could simultaneouslyindicate large blocks extremely efficiently, while still retainingfine-grained subblocks so that minor changes to the data do not resultin voluminous updates (FIG. 8).

The efficiency of such a scheme could be improved by performing thepartitioning all in one operation using increasing constraints on asingle F. For example, one could use the example hash function describedearlier, but use different values of the constant ρ to determine thedifferent levels of subdivision. By choosing appropriately relatedvalues of ρ, the set of boundaries that could be produced by thedifferent F could be arranged to be subsets of each other, resulting ina tree structure of subblocks. For example, values of ρ of 32, 64, and128, and 256 could be used. FIG. 9 shows how the subblocks of fourlevels of the tree could relate to each other:

A further method could define the hash of a larger block to be the hashof the hashes of its component blocks.

Multiple partitionings could also be useful simply to provide a widerpool of subblocks to compare. For example, it may be appropriate topartition each block of data in W different ways using W differentfunctions F₁ . . . F_(W) where each function yields roughly the samesubblock sizes, but at different positions within the block.

Another technique would be to create an additional set of boundariesbased on the boundaries provided by a hash function. For example, afractal algorithm could be used to partition a block based upon someother partitioning provided by a function F.

Comparing Subblocks

In most applications of this invention, there will be a need at somestage to identify identical subblocks. This can be done in a variety ofways:

Compare the subblocks themselves.

Compare the hashes of the subblocks.

Compare identifies of the subblocks.

Compare references to the subblocks.

In most cases, the problem reduces to that of taking a group ofsubblocks of data and finding all subsets of identical subblocks. Thisis a well-solved problem and discussion of various solutions can befound in the following books:

Knuth D. E., "The Art of Computer Programming: Volume 1: FundamentalAlgorithms". Addison Wesley, 1973.

Knuth D. E., "The Art of Computer Programming: Volume 3: Sorting andSearching", Addison Wesley, 1973.

In most cases, the problem is best solved by creating a data structurethat maintains the subblocks, or references to the subblocks, in sortedorder, and then inserts each subblock one at a time into the datastructure. Not only does this identify all currently identicalsubblocks, but it also establishes a structure that can be used todetermine quickly whether incoming subblocks are identical to any ofthose already held. The following data structures are described in thebooks referenced above and provide just a sample of the structures thatcould be used:

Hash tables.

Sorted trees (binary, N-ary, AVL).

Sorted linked lists.

Sorted arrays.

Of the multitude of solutions to the problem of matching blocks of data,one solution is worthy of special attention: the hash table. Hash tablesconsist of a (usually) finite array of slots into which values may beinserted. To add a value to a hash table, the value is hashed (using ahash function that is usually selected from the class of narrow hashfunctions) into a slot number, and the value is inserted into that slot.Later, the value can be retrieved in the same manner. Provisions must bemade for the case where two data values, to be stored in the same table,hash to the same slot number.

Hash tables are likely to be of particular value in the implementationof this invention because:

They provide very fast (essentially constant time) access.

Many applications will need to calculate a strong one-way hash of eachsubblock, and a portion of this value can be used to index the hashtable.

Particularly effective would be a hash table indexed by a portion of astrong one-way hash of the subblocks it stores, with each table entrycontaining (a) the strong one-way hash of the subblock, and (b) apointer to the subblock stored elsewhere in memory.

The Use of Compression, Encryption, and Integrity Techniques

Aspects of the invention could be enhanced by the use of datacompression, data encryption, and data integrity techniques. Theapplications of these techniques include, but are not limited to, thefollowing applications:

Any subblock that is transmitted or represented in its raw form couldalternatively be transmitted or represented in a compressed or encryptedform.

Subblocks could be compressed and encrypted before further processing byaspects of this invention.

Blocks could be compressed and encrypted before further processing byaspects of this invention.

Communications or representations could be compressed or encrypted.

Any component could carry additional checking information such aschecksums or digests of the data in the component.

Ad-hoc data compression techniques could be used to further compressreferences and identities or consecutive runs of references andidentities.

Storage of Variable-Length Subblocks on Disk

The division of data into subblocks of varying length presents somestorage organization problems (if the subblocks are to be storedindependently of each other), as most hardware disk systems areorganized to store an array of fixed-length blocks (e.g. one million512-byte blocks) rather than variable-length ones. Here are sometechniques that could be used to tackle this problem:

Each subblock could be stored in an integral number of disk blocks, withsome part of the last disk block being wasted. For randomly sizedsubblocks, this scheme will waste on average half a disk block persubblock.

Create a small subset of different bucket sizes (e.g. powers of two) andcreate arrays on the disk that pack collections of these bucketsefficiently into the disk blocks. For example, if disk blocks were 512bytes long, one could fairly efficiently pack five 200-byte buckets intoan array of two disk blocks. Each subblock would be stored in thesmallest bucket size that would hold the subblock, with the unused partof the bucket being wasted.

Treat the disk blocks as a vast array of bytes, and use well-establishedheap management techniques to manage the array. A sample of suchtechniques appears in pages 435-451 of the book:

Knuth D. E., "The Art of Computer Programming: Volume 1: FundamentalAlgorithms", Addison Wesley, 1973.

The Use of Concurrency

Two processes are said to be concurrent if their execution takes placein some sense at the same time:

In interleaving concurrency, some or all of the operations performed bythe two processes are interleaved in time, but the two processes arenever both executing at exactly the same instant.

In genuine concurrency, some or all of the operations performed by thetwo processes are genuinely executed at the same instant.Implementations of the present invention could incorporate either formof concurrency to various degrees. In most of the aspects of theinvention, some subset of the steps of each aspect could be performedconcurrently. In particular (without limitation):

A block could be split into parts and the parts partitionedconcurrently.

The processing of subblocks defined during a sequential partitioning ofa block need not be deferred until the entire block has beenpartitioned. In particular, the hashes of already-defined subblockscould be calculated and compared while further subblocks are beingdefined.

Communicating entities that decompose and compose blocks could executeconcurrently.

Where more than one block must be partitioned for processing, suchpartitioning could be performed concurrently.

Many more forms of concurrency within aspects of this invention could beidentified.

Example: Partitioning a Block

We now present a simple example of how a block might be partitioned inpractice. Consider the following block of bytes:

b₁ b₂ b₃ b₄ b₅ b₆ b₇ b₈ b₉ . . .

In this example, an example hash function H will be used to partitionthe block. Boundaries will be represented by pairs such as B₆ |b₇. Wewill assume that H returns a boolean value based on its argument andthat a boundary is to be placed at each b_(i) |b_(i) +1 for whichH(b_(i) -2, b_(i) -1, b_(i)) evaluates to true.

As the hash function accepts 3 byte arguments, we start at b₃ |b₄ andevaluate H(b₁, b₂, b₃). This turns out to be false (for the purposes ofexample), so we move to b₄ |b₅ and evaluate H(b₂, b₃, b₄). This turnsout to be true, so a boundary is placed at b₄ |b₅. Next, we move to b₅|b₆ and evaluate H(b₃, b₄, b₅). This turns out to be false so we moveon. H(b₄, b₅, b₆) is true so we place a boundary at b₆ |b₇. This processcontinues until the end of the block is reached.

b₁ b₂ b₃ b₄ | b₅ b₆ | b₇ b₈ b₉ . . .

Some variations on this approach are:

Imposition of a lower bound L on subblock size by skipping ahead L bytesafter placing a boundary.

Imposition of an upper bound U on block size by artificially placing aboundary if U bytes have been processed since the last boundary wasplaced.

Improving the efficiency of the hash calculations by using some part ofthe calculation of the has of the bytes at one position to calculate thehash at the next position. For example, it may be more efficient tocalculate H(x,y,z) if H(*,x,y) has already been calculated. For example,the Internet IP checksum is organized so that a single running checksumvalue can be maintained, with bytes entering the window being added tothe checksum, and bytes exiting the window being subtracted from thechecksum.

Applying this algorithm in reverse, starting from the end of the blockand working backwards.

Finding the subblock that encloses a particular point (chosen fromanywhere within the block) by exploring in both directions from thepoint, looking for the nearest boundary in each direction.

Finding all subblock boundaries in one step of evaluating F for allposition concurrently.

Example: Forming a Table of Hashes

Once a block has been partitioned, the hash of each subblock can becalculated to form a table of hashes (FIG. 24).

This table of hashes can be used to determine if a new subblock isidentical to any of the subblocks whose hashes are in the table. To dothis, the new subblock's hash is calculated and a check made to see ifthe hash is in the table.

In FIG. 24, the table of hashes looks like an array of hashes. However,the table of hashes could be stored in a wide variety of data structures(e.g. hash tables, binary trees).

Example Application: A File Comparison Utility

As the invention provides a new way of finding similarities betweenlarge volumes of data, it follows that it should find some applicationin the comparison of data.

In one aspect, the invention could be used to determine the broadsimilarities between two files being compared by a file comparisonutility. The utility would partition each of the two files intosubblocks, organize the hashes of the subblocks somehow (e.g. using ahash table) to identify all identical subblocks, and then use thisinformation as a framework for reporting similarities and differencesbetween the two files.

In a similar aspect, the invention could be used to find similaritiesbetween the contents of large numbers of files in a file system. Autility incorporating the invention could read each file in an entirefile system, partition each into subblocks and then insert the subblocks(or hashes of the subblocks) into one huge table (e.g. implemented by ahash table or a binary tree). If each entry in the table carried thename of the file containing it as well as the position of the subblockwithin the file, the table could later be used to identify those filescontaining identical portions of data.

If, in addition, a facility was added for recording and comparing thehashes of the entire contents of files and directory trees, a utilitycould be constructed that could identify all largely similar structureswithin a file system. Such a utility would be immensely useful when(say) attempting to merge the data on several similar backup tapes.

Example Application: A Fine-Grained Incremental Backup System

In a fine-grained incremental backup system, two entities E1 and E2(e.g. two computers on a network) wish to repeatedly backup a file X atE1 such that the old version of the file Y at E2 will be updated tobecome a copy of the new version of the file X at E1 (without modifyingX). The system could work as follows.

Each time E1 performs a backup operation, it partitions X into subblocksand writes the hashes of the subblocks to a shadow file S. It might alsowrite a hash of the entire contents of X to the shadow file. After thebackup has been completed, X will be the same as Y and so the shadowfile S will correspond to both X and Y. Once X is again modified (duringthe normal operation of the computer system), S will correspond only toY. S can then be used during the next backup operation.

To perform the backup, E1 compares the hash of Y (stored in S) againstthe hash of X to see if X has changed (it could also use themodification date file attribute of the file). If X hasn't changed,there is no need to perform any further backup action. If X has changed,E1 partitions X into subblocks, and compares the hashes of thesesubblocks with the hashes in the shadow file S, so as to find allidentical hashes. Identical hashes identify identical subblocks in Ythat can be transmitted by reference. E1 then transmits the file as amixture of raw subblocks and references to subblocks whose hashes appearin S and which are therefore known to appear as subblocks in Y. E1 canalso transmit references to subblocks already transmitted. Referencescan take many forms including (without limitation):

A hash of the subblock.

The number of the subblock in the list of subblocks in Y.

The number of a subblock previously transmitted.

A range of any of the above.

Throughout this process, E1 can be constructing the new shadow filecorresponding to X. FIG. 25 illustrates the backup process.

To reconstruct X from Y and D (the incremental backup information beingsent from E1), E2 partitions Y into subblocks said calculates the hashesof the subblocks (It could do this in advance during the previousbackup). It then processes the incremental backup information, copyingsubblocks that were transmitted raw and looking up the references eitherin Y or in the part of X already reconstructed.

Because information need only flow from E1 to E2 during the backupoperation, there is no need for E1 and E2 to perform the backupoperation concurrently. E1 can perform its side of the backup operationin isolation, producing an incremental backup file that can be laterprocessed by E2.

There is a tradeoff between 1) the approximate ratio between the size ofeach file and that of its shadow, and 2) the mean subblock size. Thehigher the mean subblock size (as determined by the partitioning methodused), the fewer subblocks per unit file length, and hence the shorterthe shadow size per unit file length. However, increasing means subblocksizes implies increasing the granularity of backups which can cause anincrease in the size of the incremental backup file. There is also atradeoff between the shadow file size and the hash width. A shadow filethat uses 128-bit hashes will be about twice as long as one that uses64-bit hashes. All these tradeoffs must be considered closely whenchoosing an implementation.

    ______________________________________                                        Bytes  Description                                                            ______________________________________                                        16     MD5 digest of the file Y corresponding to this shadow file.            16     MD5 digest of the first subblock in Y.                                 16     MD5 digest of the second subblock in Y.                                ..     . . .                                                                  16     MD5 digest of the last subblock in Y.                                  16     MD5 digest of the rest of this shadow file.                            ______________________________________                                    

The first field contains the MD5 digest (a form of cryptographic hash)of the entire contents of Y. This is included so that it can be copiedto the incremental backup file so as to provide a check later that theincremental backup file is not being applied to the wrong version of Y.It could also be used to determine if any change has been made to Xsince the previous backup Y was taken. The first field is followed by alist of the MD5 digests of the subblocks in Y in the order in which theyappear in Y. Finally, a digest of the contents of the shadow file (lessthis field) is included at the end so as to enable the detection of anycorruption of the shadow file.

The format of the incremental backup file is as follows:

    ______________________________________                                        Bytes   Description                                                           ______________________________________                                        16      MD5 digest of Y.                                                      16      MD5 digest of X.                                                      ..      Zero or more ITEMS.                                                   16      MD5 digest of the rest of the incremental backup                      ______________________________________                                                file.                                                             

The first two fields of the incremental backup file contain the MD5digest of the old and new versions of the file. The hash of the newversion X is calculated directly from X. The hash of the old version isobtained from the first field of the shadow file. These two valuesenable the remote backup entity E2 to check that:

The backup file Y (to be updated) is identical to the one from which theshadow file was generated.

The reconstructed X is identical to the original X.

The two checking fields are followed by a list of items followed by achecking digest of the rest of the incremental backup file.

Each item in the list of items describes one or more subblocks in thelist of subblocks that can be considered to constitute X. There arethree kinds of item, and each item commences with a byte having a valueone, two, or three to indicate the kind of item. Here is a descriptionof the content of each of the three kinds of item:

1. The 32-bit index of s subblock in Y. Because E2 possesses Y, it canpartition Y itself to construct the same partitioning that was used tocreate the shadow file. Thus E1 doesn't need to send the hash of anysubblock that is in both X and Y. Instead, it need only send the indexof the subblock in the list of subblocks constituting Y. This list isrepresented by the list of hashes in the shadow file S. As 32-bits iswide enough for an index in practice, the saving gained by communicatinga 32-bit index instead of a hash is 98 bits for each such item.

2. A pair of 32-bit numbers being the index of the first and lastsubblock of a range of subblocks in Y. Old and new versions of filesoften share large contiguous ranges of subblocks. The use of this kindof item allows such ranges to be represented using just 64 bits insteadof a long run of instances of the first kind of item.

3. A 32-bit value containing the number of bytes in the subblock,followed by the raw content of the subblock. This kind of item is usedif the subblock to be transmitted is not present in Y.

In the implementation, all the values are coded in little-endian form.Big-endian could be used equally as well.

The existing implementation could be further optimized by (withoutlimitation):

Adding an additional kind of item that refers to subblocks in X alreadytransmitted;

Adding an additional kind of item that refers to ranges of subblocks inX already transmitted;

Employing data compression techniques to compress the raw blocks in thethird kind of item.

Using the first hash in the shadow file to check to see if the entirefile has changed at all before performing the backup process describedabove.

Replacing hashes in S of subblocks in Y by references to other hashes inS (where the hashes (and hence subblocks) are identical). Repeated runsof hashes could also be replaced by pointers to ranges of hashes.

The scheme described above has been described in terms of a single file.However, the technique could be applied repeatedly to each of the filesin a file system, thus providing a way to back up an entire file system.The shadow information for each file in the file system could be storedinside a separate shadow file for each file, or in a master shadow filecontaining the shadows for one or more (or all) files in the filesystem.

Although most redundancy in a file system is likely to be found withindifferent versions of each file, there may be great similarities betweenversions of different files. For example, if a file is renamed, the"new" file will be identical to the "old" file. Such redundancy can ecatered for by comparing the hashes of all the files in the old and newversions of a file system. In addition, similarities between differentparts of different files can be exploited by comparing the hashes ofsubblocks of each file to be backed up with the hashes of the subblocksof the entire old version of the file system.

If E2 has lots of space, a further improvement could be for E1 to retainthe shadows of all the previous versions of the file system, and for E2to retain copies of all the previous versions of the file system. E1could then refer to every block it has ever seen. This technique couldalso be applied on a file-by-file basis.

In a further variant, the dependence on the ordering of subblocks couldbe abandoned, and E1 could simply keep a shadow file containing a listof the hashes of all the subblocks in the previous version (or versions)of the file or file system. E2 would then need to record only a singlecopy of each unique subblock it has ever received from E1.

Aspects of the backup application described in this section can beintegrated cleanly into existing backup architecture by deploying thenew mechanisms within the framework of existing ones. For example, thetraditional methods for determining if a file has changed since the lastbackup (modification date, backup date and so on) can be used to see ifa file needs to be backed up at all, before applying the new mechanisms.

Example Application: A Low-Redundancy File System

We now present an example of a low-redundancy file system that attemptsto avoid storing different instances of the same data more than once. Inthis example, the file system is organized as shown in FIG. 26.

The bottom layer consists of a collection of unique subblocks of varyinglength that are stored somewhere on the disk. The middle layer consistsof a hash table containing one entry for each subblock. Each entryconsists of a cryptographic hash of the subblock, a reference count forthe subblock, and a pointer to the subblock on disk. The hash table isindexed by some part of the cryptographic hash (e.g. the bottom 16bits). Although a hash table is used in this example, many other datastructures (e.g. a binary tree) could also be used to map cryptographichashes to subblock entries. It would also be possible to index thesubblocks directly without the use of cryptographic hashes.

The top layer consists of a table of files that binds filenames to listsof subblocks, each list being a list of indexes into the hash table. Thereference count of the hash table records the number of references tothe subblock that appear in the entire set of files in the file table.The issue of hash table "overflow" can be addressed using a variety ofwell-known overflow techniques such as that of attaching a linked listto each hash slot.

When a file is read, the list of hash table indexes is converted topointers to subblocks of data using the hash table. If random access tothe file is required, extra information about the length of thesubblocks could be added to the file table and/or hash table so as tospeed access.

Writing a file is more complicated. During a sequential write, the databeing written is buffered until a subblock-boundary is reached (asdetermined by whatever boundary function is being used). Thecryptographic hash of the new subblock is then calculated and used tolook up the hash table. If the subblock is unique (i.e. there is noentry for the cryptographic hash), it is added to the data blocks on thedisk and an entry is added to the hash table. A new subblock number isadded to the list of blocks in the file table. If, on the other hand,the subblock already exists, the subblock need not be written to disk.Instead, the reference count of the already-existing subblock isincremented, and the subblock's hash table index is added to the list ofblocks in the file's entry in the file table.

Random access writes are more involved, but essentially the sameprinciples apply.

If a record were kept of subblocks created since the last backup,backing up this file system could be very efficient indeed.

One enhancement that could be made is to exploit unused disk space.Instead of automatically ignoring or overwriting subblocks whosereference count has dropped to zero, the low-redundancy file systemcould move them to a pool of unused subblocks. These subblocks, whilenot present in any file, could still form part of the subblock poolreferred to when checking to see if incoming subblocks are alreadypresent in the file system. The space consumed by subblocks in theunused subblock pool would be recycled only when the disk was full. Inthe steady state, the "unused" portion of the disk would be filled bysubblocks in the unused subblock pool.

Although this section has specifically described a low-redundancy filesystem, this aspect of the invention is really a general purpose storagesystem that could be applied at many levels and in many roles ininformation processing systems. For example:

The technique could be used to implement a low-redundancy virtual memorysystem. The contents of memory could be organized as a collection ofsubblocks.

The technique could be used to increase the efficiency of an on-chipcache.

Example Application: A Communication System

A method is now presented for reducing redundant transmissions incommunications systems. Consider two entities E1 and E2, where E1 musttransfer a block of data X to E2. E1 and E2 need never have communicatedpreviously with each other.

The conventional way to perform the transmission is simply for E1 totransmit X to E2. However, here E1 first partitions X into subblocks andcalculates the hash of each subblock using a hash function. It thentransmits the hashes to E2. E2 then looks up the hashes in a table ofhashes of all the subblocks it already possesses. E2 then transmits toE1 information (e.g. a list of subblock numbers) identifying thesubblocks in X that E2 does not already possess. E1 then transmits justthose subblocks.

Another way to perform the transaction would be for E2 to first transmitto E1 the hashes of all the subblocks it possesses (or perhaps a wellchosen subset of them). E1 could then transmit references to subblocksin X already known to E2 and the actual contents of subblocks in X notknown to E2. This scheme could be more efficient than the earlier schemein cases where E2 possesses less subblocks than there are in X.

Another way to perform the transaction would be for E1 and E2 to conducta more complicated conversation to establish which subblocks E2possesses. For example, E2 could send E1 the hashes of just some of thesubblocks it possesses (perhaps the most popular ones). E1 could thensend to E2 the hashes of other subblocks in X. E2 could then replyindicating which of those subblocks it truly does not possess. E1 couldthen send to E2 the subblocks in X not possessed by E2.

In a more sophisticated system, E1 and E2 could keep track of the hashesof the subblocks possessed by the other. If either entity ever sent (forwhatever reason) a reference to a subblock not possessed by the otherentity, the latter entity could simply send back a request for thesubblock to be transmitted explicitly and the former entity could sendthe requested subblock.

The communication application described above considers the case of justtwo communicants. However, there is no reason why the scheme could notbe generalized to cover more than two communicants communicating witheach other in private and in public (using broadcasts). For example, tobroadcast a block, a computer C₁ could broadcast a list of the hashes ofthe block's subblocks. Computers C₂ . . . C_(N) could then each replyindicating which subblocks they do not already possess. C₁ could thenbroadcast subblocks that many of the other computers do not possess, andsend the subblocks missing from only a few computers to those computersprivately.

All these techniques have the potential to greatly reduce the amount ofinformation transmitted between computers.

These techniques would be very efficient if they were implemented on topof the file system described earlier, as the file system would alreadyhave performed the work of organizing all the data it possesses intoindexed subblocks. The potential savings in communication that could bemade if many different computer systems shared the same subblockpartitioning algorithm suggests that some form of universalstandardization on a particular partitioning method would be a worthygoal.

Example Application: A Subblock Server

Aspects of the invention could be used to establish a subblock server ona network so as to reduce network traffic. A subblock server could belocated in a busy part of a network. It would consist of a computer thatbreaks each block of data it sees into subblocks, hashes the subblocks,and then stores them for future reference. Other computers on thenetwork could send requests to the server for subblocks, the requestsconsisting of the hashes of subblocks the server might possess. Theserver would respond to each hash, returning either the subblockcorresponding to the hash, or a message stating that the server does notpossess a subblock corresponding to the hash.

Such a subblock server could be useful for localizing network traffic onthe Internet. For example, if a subnetwork (even a large one for (say)an entire country) placed a subblock server on each of its majorInternet connections, then (with the appropriate modification of variousprotocols) much of the traffic into the network could be eliminated. Forexample, if a user requested a file from a remote host on anothernetwork, the user's computer might issue the request and receive, inreply, not the file, but the hashes of the file's subblocks. The user'scomputer could then send the hashes to the local subblock server to seeif the subblocks are present there. It would receive the subblocks thatare present and then forward a request for the remaining subblocks tothe remote host. The subblock server might notice the new subblocksflowing through it and archive them for future reference. The entireeffect could be to eliminate most repeated data transfers between thesubnetwork and the rest of the Internet. However, the securityimplications of schemes such as these would need to be closelyinvestigated before there were deployed.

A further step could be to create "virtual" subblock servers that storethe hashes of subblocks and their location on the Internet rather thanthe subblocks and their hashes.

I claim:
 1. A method for organizing a block b of digital data forstorage, communication, or comparison, comprising the stepof:partitioning said block b into a plurality of subblocks at at leastone position k|k+1 within said block, for which b[k-A+1 . . . k+B]satisfies a predetermined constraint, and wherein A and B are naturalnumbers.
 2. The method of claim 1, wherein the constraint comprises thehash of at least a portion of b[k-A+1 . . . k+B].
 3. The method of claim1, further comprising the step of:locating the nearest subblock boundaryon a side of a position p|p+1 within said block, said locating stepcomprising the step of: evaluating whether said predetermined constraintis satisfied at each position k|k+1 for increasing or decreasing k,wherein k starts with the value p.
 4. The method of claim 1, wherein atleast one bound is imposed on the size of at least one of said pluralityof subblocks.
 5. The method of claim 1, wherein additional subblocks areformed from at least one group of subblocks.
 6. The method of claim 1,wherein an additional hierarchy of subblocks is formed from at least onegroup of contiguous subblocks.
 7. The method of claim 1, furthercomprising the step of:calculating the hash of each of at least one ofsaid plurality of subblocks.
 8. The method of claim 1, furthercomprising the step of:forming a projection of said block, being anordered or unordered collection of elements, wherein each elementconsists of a subblock, an identity of a subblock, or a reference of asubblock.
 9. The method of claim 1, wherein said subblocks are comparedby comparing the hashes of said subblocks.
 10. The method of claim 1,wherein subsets of identical subblocks within a group of one or moresubblocks are found by inserting each subblock, an identity of eachsubblock, a reference of each subblock, or a hash of each subblock intoa data structure.
 11. A method for comparing one or more blocks,comprising the steps of:organizing a block b of digital data for thepurpose of comparison, comprising the step of:partitioning said block binto a plurality of subblocks at at least one position k|k+1 within saidblock; for which b[k-A+1 . . . k+B] satisfies a predeterminedconstraint; andwherein A and B are natural numbers, forming a projectionof each said block, being a collection of elements, wherein each elementcomprises a selected one of a subblock, an identity of a subblock, and areference of a subblock, and comparing the elements of said projectionsof said blocks.
 12. A method for representing one or more blockscomprising a collection of subblocks and block representatives which aremapped to lists of entries which identify subblocks; said methodcomprising the step of modifying one of said blocks including the stepsof:partitioning said block into a plurality of subblocks at at least oneposition k|k+1 within said block, for which b[k-A+1 . . . k+B] satisfiesa predetermined constraint, and wherein A and B are natural numbers,adding to said collection of subblocks zero or more subblocks which arenot already in said collection, and updating said subblock listassociated with said modified block.
 13. A method for representing oneor more blocks comprising a collection of subblocks and blockrepresentatives which are mapped to lists of entries which identifysubblocks; said method comprising the step of modifying one of saidblocks including the steps of:partitioning said block into a pluralityof subblocks at at least one position k|k+1 within said block, for whichb[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, removing from said collection of subblockszero or more subblocks, and updating said subblock list associated withsaid modified block.
 14. A method for representing one or more blockscomprising a collection of subblocks and block representatives which aremapped to lists of entries which identify subblocks; said methodcomprising the step of modifying one of said blocks including the stepsof:partitioning said block into a plurality of subblocks at at least oneposition k|k+1 within said block, for which b[k-A+1 . . . k+B] satisfiesa predetermined constraint, and wherein A and B are natural numbers,adding to said collection of subblocks zero or more subblocks that arenot already in the collection, removing from said collection ofsubblocks zero or more subblocks, and updating said subblock listassociated with said modified block.
 15. A method for an entity E1 tocommunicate a block X to E2 where E1 possesses the knowledge that E2possesses a group of Y subblocks Y₁ . . . Y_(m), comprising the stepsof:partitioning said block X into a plurality of subblocks X₁ . . .X_(n) at at least one position k|k+1 within said block, for whichX[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, and transmitting from E1 to E2 the contentsof zero or more subblocks in X₁ and the remaining subblocks asreferences to subblocks in Y₁ . . . Y_(m), and to subblocks transmitted.16. A method for an entity E1 to communicate one or more subblocks of agroup X of subblocks X₁ . . . X_(n) to E2 where E1 possesses theknowledge that E2 possesses a block Y, comprising the stepsof:partitioning said block Y into a plurality of subblocks Y₁ . . .Y_(m) at at least one position k|k+1 within said block, for whichY[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, and transmitting from E1 to E2 the contentsof zero or more subblocks in X, and the remaining subblocks asreferences to subblocks in Y, and to subblocks already transmitted. 17.A method for an entity E1 to communicate a block X to E2 where E1possesses the knowledge that E2 possesses a block Y, comprising thesteps of:partitioning said block X into a plurality of subblocks X₁ . .. X_(n) at at least one position k|k+1 within said block, for whichX[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, partitioning said block Y into a plurality ofsubblocks Y₁ . . . Y_(m) at at least one position k|k+1 within saidblock, for which Y[k-A+1 . . . k+B] satisfies a predeterminedconstraint, and wherein A and B are natural numbers, and transmittingfrom E1 to E2 the contents of zero or more subblocks in X, and theremaining subblocks as references to subblocks in Y, and to subblocksalready transmitted.
 18. A method for constructing a block D from ablock X and a group Y of subblocks Y₁ . . . Y_(m) such that X can beconstructed from Y and D, comprising the steps of:partitioning saidblock X into a plurality of subblocks X₁ . . . X_(n) at at least oneposition k|k+1 within said block, for which X[k-A+1 . . . k+B] satisfiesa predetermined constraint, and wherein A and B are natural numbers, andconstructing D from a selected at least one of:the contents of zero ormore subblocks in X, references to zero or more subblocks in Y, andreferences to zero or more subblocks in D.
 19. A method for constructinga block D from a group X of subblocks X₁ . . . X_(n) and a block Y suchthat X can be constructed from Y and D, comprising the stepsof:partitioning said block Y into a plurality of subblocks Y₁ . . .Y_(m) at at least one position k|k+1 within said block, for whichY[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, and constructing D from a selected at leastone of:the contents of zero or more subblocks in X, references to zeroor more subblocks in Y, and references to zero of more subblocks in D.20. A method for constructing a block D from a block X and a block Ysuch that X can be constructed from Y and D, comprising the stepsof:partitioning said block X into a plurality of subblocks X₁ . . .X_(n) at at least one position k|k+1 within said block, for whichX[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, partitioning said block Y into a plurality ofsubblocks Y₁ . . . Y_(m) at at least one position k|k+1 within saidblock, for which Y[k-A+1 . . . k+B] satisfies a predeterminedconstraint, and wherein A and B are natural numbers, and constructing Dfrom a selected at least one of:the contents of zero or more in X,references to zero or more subblocks in Y, and references to zero ormore subblocks in D.
 21. A method for constructing a block D from ablock X and a projection Y said projection comprising a collection ofelements wherein said elements comprises a subblock in Y, an identity ofa subblock in Y, or a reference of a subblock in Y, such that X can beconstructed from Y and D, comprising the steps of:partitioning saidblock X into a plurality of subblocks X₁ . . . X_(n) at at least oneposition k|k+1 within said block, for which X[k-A+1 . . . k+B] satisfiesa predetermined constraint, and wherein A and B are natural numbers, andconstructing D from a selected at least one of:the contents of zero ormore in X, references to zero or more subblocks in Y, and references tozero or more subblocks in D.
 22. A method for constructing a block Xfrom a block Y and a block D, comprising the steps of:partitioning saidblock Y into a plurality of subblocks Y₁ . . . Y_(m) at at least oneposition k|k+1 within said block, for which Y[k-A+1 . . . k+B] satisfiesa predetermined constraint, and wherein A and B are natural numbers, andconstructing X from D and Y by constructing the subblocks of X based ona selected at least one of:subblocks contained within D, references in Dto subblocks in Y, and references to D to subblocks in D.
 23. A methodfor constructing a group X of subblocks X₁ . . . X_(n) from a block Yand a block D, comprising the steps of:partitioning said block Y into aplurality of subblocks Y₁ . . . Y_(m) at at least one position k|k+1within said block, for which Y[k-A+1 . . . k+b] satisfies apredetermined constraint, and wherein A and B are natural numbers, andconstructing X₁ . . . X_(n) from D and Y based on a selected at leastone of:subblocks contained within D, references in D to subblocks in Y,and references to D to subblocks in D.
 24. A method for communicating adata block X from one entity E1 to another entity E2, comprising thesteps of:partitioning said block X into a plurality of subblocks X₁ . .. X_(n) at at least one position k|k+1 within said block, for whichX[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, transmitting from E1 to E2 an identity of atleast one subblock, transmitting from E2 to E1 information communicatingthe presence or absence of subblocks at E2, and transmitting from E1 toE2 at least the subblocks identified as not being present at E2.
 25. Amethod for communicating a block X from one entity E1 to another entityE2, comprising the steps of:partitioning said block X into a pluralityof subblocks X₁ . . . X_(n) at at least one position k|k+1 within saidblock, for which X[k-A+1 . . . k+B] satisfies a predeterminedconstraint, and wherein A and B are natural numbers, transmitting fromE2 to E1 information communicating the presence or absence at E2 ofmembers of a group Y of subblocks Y₁ . . . Y_(m), and transmitting fromE1 to E2 the contents of zero or more subblocks in X, and the remainingsubblocks as references to subblocks in Y₁ . . . Y_(m) and to subblocksalready transmitted.
 26. A method for an entity E2 to communicate to anentity E1 the fact that E2 possesses a block Y, comprising the stepsof:partitioning said block Y into a plurality of subblocks Y₁ . . .Y_(m) at at least one position k|k+1 within said block, for whichY[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, and transmitting from E2 to E1 references ofthe subblocks Y₁ . . . Y_(m).
 27. A method for an entity E1 tocommunicate a subblock X₁ to an entity E2, comprising the stepsof:partitioning said block X into a plurality of subblocks X₁ . . .X_(n) at at least one position k|k+1 within said block, for whichX[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers, transmitting from E2 to E1 an identity ofX_(i), transmitting X_(i) from E1 to E2.
 28. An apparatus for organizinga block b of digital data for storage, communication, or comparison,comprisingmeans for partitioning said block b into a plurality ofsubblocks at at least one position k|k+1 within said block, for whichb[k-A+1 . . . k+B] satisfies a predetermined constraint, and wherein Aand B are natural numbers.
 29. The apparatus of claim 28, in which theconstraint comprises the hash of some or all of b[k-A+1 . . . k+B]. 30.The apparatus of claim 28, further comprisingmeans for locating thenearest subblock boundary on a side of a position p|p+1 within saidblock, said means for locating comprising:means for evaluating whethersaid predetermined constraint is satisfied at each position k|k+1 forincreasing or decreasing k, wherein k starts with the value p.